We use models to explain the relationship between variables and to make predictions. For now we will focus on linear models (but remember there are other types of models too!)
Today, we will focus on fitting and interpreting models with a continuous outcome variable and a single predictor. To do so, we will use the broom package, which takes the messy output of built-in functions in R and turns them into tidy data frames. As expected, it integrates well with the tidyverse, especially with data wrangling using dplyr.
Today’s data are come from a series of atmospheric measurements taken at the Agriculture Exhibition Hall in Beijing from 2013 - 2017. The following variables are in the dataset:
year, month, day, and hour: when the observation was takenPM2.5 and PM10: levels of fine particulate matter in the atmosphere, in \(\mu\)g/m\(^3\)SO2, NO2, CO, O3: sulfur dioxide, nitrogen dioxide, carbon monoxide, and ozone concentrations, in \(\mu\)g/m\(^3\), respectivelTEMP: temperature, in degrees CPRES: barometric pressure, in hPaDEWP: dew point temperature, in degrees CRAIN: precipitation levels, in mmwd: wind direction (eight compass directions)SWPM: wind speed, in m/sThese data are adapted from a dataset released by Song Xi Chen as referenced in Zhang S., et al (2017), Proc. Royal Soc. A. Let’s explore some of these data.
We can represent relationships between variables using functions. A function in the mathematical sense is the relationship between one or more inputs and an output created from that (those) input(s). Essentially, we can “plug in” the inputs and receive back the output.
For instance, the formula \(y = 3x + 7\) is a function with input \(x\) and output \(y\). When \(x\) is \(5\), the output \(y\) is \(22\):
x <- 5
y <- 3*x + 7
y
## [1] 22
In statistical models, the response variable (dependent variable, or outcome variable) is the variable whose behavior or variation we are trying to understand, and is generally denoted by \(y\). The explanatory variable (independent variable, or predictor variable) is the variable we want to use to explain the variation in the response, generally denoted by \(x\).
The blue line is a function that we can use to obtain predicted values of our response given values of our predictors. These predicted values are the output of the model function. That is, the expected value of the response given a certain value of the explanatory variable. However, we know that there is variability around our prediction (compare the line to the actual values). The residual is the difference between the actual observed value of the response variable observed in the dataset vs. the value predicted by the model. It is essentially the vertical distance between each point and the line.
As previously mentioned, we will be using the broom package. Some of the most commonly used functions are as follows:
tidy: Constructs a tidy data frame summarizing model’s statistical findingsglance: Constructs a concise one-row summary of the modelaugment: Adds columns (e.g. predictions, residuals) to the original data that was modeledToday, we will focus on making tidy regression output for linear models and glanceing at a model summary.
Let’s create a linear model that predicts the dew point given the current barometric pressure. We first use the lm function to create a linear model object. Let’s create and save a model, mod1, for use in later downstream analyses:
mod1 <- lm(DEWP ~ PRES, data = beijing)
Now let’s take a look at the tidy output for our object as provided by the broom package:
tidy(mod1)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1071. 22.5 47.7 1.68e-302
## 2 PRES -1.06 0.0222 -47.6 1.18e-301
We write the fitted model as follows:
\[\widehat{DEWP} = 1071 - 1.08 PRES\]
That is, the predicted dew point in degrees Celsius is given by 1071 minus 1.08 times the barometric pressure. We can interpret the intercept and slope coefficients as follows:
Intercept: if the barometric pressure is 0, we would expect the dew point to be 1071 degrees Celsius, on average (is this a meaningful quantity to estimate…?)
Slope: for each additional hectopascal of barometric pressure, we would expect the dew point to decrease by 1.08 degrees Celsius
Note that we “expect” these relationships to hold, but there will be some variability!
In general, we assume the following underlying simple linear model for a single predictor and a single continuous outcome:
\[y_i = \beta_0 + \beta_1x_i + \epsilon_i\]
\(\beta_0\) and \(\beta_1\) are parameters which aren’t observed (neither is the error, \(\epsilon_i\)). When we estimate these parameters, we denote these estimates with hats: \(\hat{\beta}_0\) and \(\hat{\beta}_1\). We often use ordinary least squares (OLS) to obtain these estimates to get the fitted model
\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i\]
Remember that residuals are the differences between the observed response and the predicted response values from the model:
\[\hat{\epsilon_i} = y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i) = y_i - \hat{y}_i\]
OLS selects \(\hat{\beta}_0\) and \(\hat{\beta}_1\) such that the sum of squared residuals is minimized. The residuals are given by the red segments in the graphic below:
## `geom_smooth()` using formula 'y ~ x'
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1071. 22.5 47.7 1.68e-302
## 2 PRES -1.06 0.0222 -47.6 1.18e-301
You might have noticed a p-value corresponding to each of the parameter estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) In regression models, p-values generally correspond to testing the null hypothesis \[H_0: \beta = 0\] vs. the alternative hypothesis \[H_1: \beta \neq 0\].
We can also get model-level summaries by using the glance function on our model object:
glance(mod1)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.602 0.601 8.81 2262. 1.18e-301 2 -5391. 10788. 10804.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
\(R^2\) (r.squared in the glance output) tells us the percentage of variability in the response variable that explained by our predictors. In our example, We see that approximately 60% of the variability in dew point can be explained by its linear relationship with barometric pressure.
Let’s fit a linear regression model using the categorical variable wd (wind direction), which takes on eight values corresponding to the compass points.
mod2 <- lm(DEWP ~ wd, data = beijing)
tidy(mod2)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 8 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.93 0.776 7.64 3.92e-14
## 2 wdN -8.17 1.39 -5.86 5.77e- 9
## 3 wdNE -3.84 1.08 -3.56 3.83e- 4
## 4 wdNW -12.6 1.22 -10.2 7.23e-24
## 5 wdS 1.95 1.38 1.41 1.60e- 1
## 6 wdSE 0.312 1.36 0.230 8.18e- 1
## 7 wdSW -2.10 1.21 -1.73 8.33e- 2
## 8 wdW -5.41 1.59 -3.40 6.95e- 4
When the categorical explanatory variable has many levels, the levels are encoded as dummy variables. Dummy variables take values of either 0 or 1, depending on whether the condition it corresponds to is satisfied. For instance, wdS takes on the value of 1 if the wind comes from the south, and 0 if it does not. Each coefficient describes the expected difference in dew point when wind comes from that particular direction compared to the baseline level (east).
Our fitted/predicted model is given by
\[\begin{align*} \widehat{DEWP} &= 5.93 - 8.17N - 3.84NE - 12.55NW + \\ &\mathrel{\phantom{=}} 1.95S + 0.31SE -2.10SW - 5.41W \end{align*}\] where the predictors correspond to dummy variables for if the wind is blowing from that direction.